Emotional Speech Classification¶

Hayden Hoopes

This analysis explores the possibilities of using deep learning models to accurately predict the emotions contained in spoken phrases. The goal is to create a model that can accurately associate each different audio file with one of eight different emotions: neutral, calm, happy, sad, angry, fearful, disgust, or surprised. To do so, I will create several different models, each of which I expect to successively increase in prediction accuracy. The models that I plan to create are as follows:

  • Naïve classifier (baseline model)
  • Artificial Neural Network (ANN)
  • Convolutional Neural Network (CNN)
  • Transfer Learning (using the OpenL3 model from Facebook)

About The Data¶

The RAVDESS data set contains 1440 files: 60 trials per actor x 24 actors = 1440. The RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech emotions includes calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.

File naming convention¶

Each of the 1440 files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 03-01-06-01-02-01-12.wav). These identifiers define the stimulus characteristics:

Filename identifiers¶

Modality (01 = full-AV, 02 = video-only, 03 = audio-only).

Vocal channel (01 = speech, 02 = song).

Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).

Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.

Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").

Repetition (01 = 1st repetition, 02 = 2nd repetition).

Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

Filename example: 03-01-06-01-02-01-12.wav

Audio-only (03) Speech (01) Fearful (06) Normal intensity (01) Statement "dogs" (02) 1st Repetition (01) 12th Actor (12) Female, as the actor ID number is even.

Data Preparation¶

The first thing I need to do is get the data into the correct format so that I can process the audio files in a neural network environment. I'll place each of the files and their associated labels into lists that I can then use to create a Tensorflow Dataset object.

In [2]:
import os
import warnings
warnings.filterwarnings("ignore")

# Get all of the file paths into an array and the class labels (emotions) into an array of the same size
data_path = 'audio_speech_actors_01-24/'

file_paths = []
labels = []

label_dict = {
    1: 'neutral',
    2: 'calm',
    3: 'happy',
    4: 'sad',
    5: 'angry',
    6: 'fearful',
    7: 'disgust',
    8: 'surprised'
}

for actor in os.listdir(data_path):
    class_path = os.path.join(data_path, actor)
    for file in os.listdir(class_path):
        labels.append(int(file.split('-')[2])) # This extracts the class label from the file name and appends it to the labels list
        file_paths.append(os.path.join(class_path, file)) # This adds the file path to the file_paths list

Create Data Sets¶

In [3]:
import numpy as np
from sklearn.model_selection import train_test_split
import librosa

train_paths, val_paths, train_labels, val_labels = train_test_split(file_paths, labels, test_size=0.2, random_state=1)

train_labels = np.array(train_labels)
val_labels = np.array(val_labels)

def load_and_process_audio(file_path, label):
    audio, sample_rate = librosa.load(file_path)
    audio = audio[:75000] # Cut off audio with more than 75000 samples
    audio = np.concatenate((np.zeros(75000-len(audio)), audio), axis=0)
    
    spectrogram = np.sqrt(librosa.feature.melspectrogram(y=audio, sr=44100))

    return spectrogram, label

Now that the functions for loading the audio files in as spectrograms is complete, I can load the audio files into Tensorflow Dataset objects.

In [4]:
import tensorflow as tf

train_data = []

for file, label in zip(train_paths, train_labels):
    spectrogram, label = load_and_process_audio(file, label)
    train_data.append(spectrogram)

val_data = []

for file, label in zip(val_paths, val_labels):
    spectrogram, label = load_and_process_audio(file, label)
    val_data.append(spectrogram)

sparse_train_labels = np.zeros((train_labels.shape[0],8))
for i, value in enumerate(np.arange(1,9)):
    sparse_train_labels[:, i] = (train_labels == value).astype(int)

sparse_val_labels = np.zeros((val_labels.shape[0],8))
for i, value in enumerate(np.arange(1,9)):
    sparse_val_labels[:, i] = (val_labels == value).astype(int)

batch_size = 32

train_dataset = tf.data.Dataset.from_tensor_slices((train_data, sparse_train_labels))
train_dataset = train_dataset.shuffle(buffer_size=len(train_paths)).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices((val_data, sparse_val_labels))
val_dataset = val_dataset.batch(batch_size)
WARNING:tensorflow:From C:\Users\HaydenH\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

Visualizing Some Different Audio Samples¶

Now that we have the datasets created, let's visualize some of the different audio samples. To see the differences between emotions, we will stick to visualizing only emotions that use the phrase "Kids are talking by the door".

In [5]:
import matplotlib.pyplot as plt

phrase_1_audio_files = [(path, label) for path, label in zip(file_paths, labels) if path.split('\\')[1].split('-')[4] == '01']
In [6]:
phrase_1_audio_file_paths = [i[0] for i in phrase_1_audio_files]
phrase_1_labels = [i[1] for i in phrase_1_audio_files]

new_class_positions = []
new_classes = []

for i, label in enumerate(phrase_1_labels):
    if label not in new_classes:
        new_class_positions.append(i)
        new_classes.append(label)
In [7]:
plt.figure(figsize=(25, 15))
fig, axs = plt.subplots(2, 4, figsize=(12, 6))

for i in range(8):
    spectrogram, label = load_and_process_audio(phrase_1_audio_file_paths[new_class_positions[i]], phrase_1_labels[new_class_positions[i]])
    axs[i//4, i%4].imshow(tf.transpose(spectrogram))
    axs[i//4, i%4].set_title(f'Spectrogram: \'{label_dict[label]}\'')

plt.show()
<Figure size 2500x1500 with 0 Axes>

Baseline Model¶

Next, I'll compute the accuracy metric for a baseline model (ie. random guessing). I can do this using the labels that I already extracted previously in the training data. It is worth noting that randomly assigning a single class to all observations would result in an accuracy score of 13.3%, meaning that this metric is a good baseline to start with.

Surprisingly, after flattening all of the features of the spectrograms, a simple decision tree model with no tuning was able to classify values in the validation set with about 30.6% accuracy. This is a tremendous increase in accuracy from random guessing, but I think that better models can be created to beat this accuracy metric.

In [8]:
from collections import Counter
import pandas as pd

pd.Series(Counter(labels)) / len(labels)
Out[8]:
1    0.066667
2    0.133333
3    0.133333
4    0.133333
5    0.133333
6    0.133333
7    0.133333
8    0.133333
dtype: float64
In [9]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score

flattened_train_data = np.array([s.flatten() for s in train_data])
flattened_val_data = np.array([s.flatten() for s in val_data])

dt = DecisionTreeClassifier()
dt.fit(flattened_train_data, train_labels)
predictions = dt.predict(flattened_val_data)
print(classification_report(val_labels, predictions))
print(accuracy_score(val_labels, predictions))
              precision    recall  f1-score   support

           1       0.23      0.35      0.28        17
           2       0.48      0.50      0.49        32
           3       0.17      0.16      0.16        44
           4       0.17      0.17      0.17        41
           5       0.37      0.56      0.44        27
           6       0.31      0.29      0.30        45
           7       0.41      0.28      0.33        46
           8       0.34      0.31      0.32        36

    accuracy                           0.31       288
   macro avg       0.31      0.33      0.31       288
weighted avg       0.31      0.31      0.30       288

0.3055555555555556

Artificial Neural Network¶

Next, I'll use an artificial neural network (ANN) to try and increase the accuracy of this classification model slightly. While I don't think that the neural network will perform significantly better than the baseline model, I do expect the model to spot nonlinear patterns in the data set that could give it additional classification power.

As evidenced below, the simple neural network created a model with 2,419,176 parameters that ended up performing with a 45% accuracy against the validation set. This performance is better than the decision tree model, but still likely isn't seeing localized patterns in the data set because all dimensional information is being lost when the spectrogram is flattened into a single dimension array.

From the graphs, it appears that the loss on the validation model constantly increases as the model goes through more iterations of backpropagation. This could indicate that the model is instantly overfitting. I might even say that the model peaked somewhere around 3 or 4 iterations, which is where the validation accuracy is highest.

According to the classification report, the model is best at predicting class 2 (calm) but never identified any audio recordings of class 1 (neutral). Perhaps convolutional neural networks will help this model improve its ability to predict this and all other classes.

In [10]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(units=128, activation='relu', input_shape=(18816,)),
    layers.Dense(units=64, activation='relu', input_shape=(18816,)),
    layers.Dense(units=32, activation='relu'),
    layers.Dense(units=8, activation='softmax')
])

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()
WARNING:tensorflow:From C:\Users\HaydenH\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\backend.py:873: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From C:\Users\HaydenH\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\optimizers\__init__.py:309: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 128)               2408576   
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 dense_2 (Dense)             (None, 32)                2080      
                                                                 
 dense_3 (Dense)             (None, 8)                 264       
                                                                 
=================================================================
Total params: 2419176 (9.23 MB)
Trainable params: 2419176 (9.23 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
In [11]:
callbacks = [keras.callbacks.ModelCheckpoint('ann.keras', save_best_only=True)]
history = model.fit(flattened_train_data, sparse_train_labels, validation_data=(flattened_val_data, sparse_val_labels), epochs=25, batch_size=32, callbacks=callbacks)
Epoch 1/25
WARNING:tensorflow:From C:\Users\HaydenH\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\utils\tf_utils.py:492: The name tf.ragged.RaggedTensorValue is deprecated. Please use tf.compat.v1.ragged.RaggedTensorValue instead.

WARNING:tensorflow:From C:\Users\HaydenH\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\engine\base_layer_utils.py:384: The name tf.executing_eagerly_outside_functions is deprecated. Please use tf.compat.v1.executing_eagerly_outside_functions instead.

36/36 [==============================] - 1s 13ms/step - loss: 1.9496 - accuracy: 0.1970 - val_loss: 1.8686 - val_accuracy: 0.2292
Epoch 2/25
36/36 [==============================] - 0s 9ms/step - loss: 1.6300 - accuracy: 0.3950 - val_loss: 1.7147 - val_accuracy: 0.3785
Epoch 3/25
36/36 [==============================] - 0s 8ms/step - loss: 1.3318 - accuracy: 0.5295 - val_loss: 1.8072 - val_accuracy: 0.3785
Epoch 4/25
36/36 [==============================] - 0s 8ms/step - loss: 1.0776 - accuracy: 0.6328 - val_loss: 1.7492 - val_accuracy: 0.3958
Epoch 5/25
36/36 [==============================] - 0s 9ms/step - loss: 0.8656 - accuracy: 0.7179 - val_loss: 1.7125 - val_accuracy: 0.4549
Epoch 6/25
36/36 [==============================] - 0s 7ms/step - loss: 0.6912 - accuracy: 0.7977 - val_loss: 1.9694 - val_accuracy: 0.4757
Epoch 7/25
36/36 [==============================] - 0s 8ms/step - loss: 0.5592 - accuracy: 0.8429 - val_loss: 2.2183 - val_accuracy: 0.4618
Epoch 8/25
36/36 [==============================] - 0s 7ms/step - loss: 0.4474 - accuracy: 0.8776 - val_loss: 2.4538 - val_accuracy: 0.4722
Epoch 9/25
36/36 [==============================] - 0s 7ms/step - loss: 0.3720 - accuracy: 0.8993 - val_loss: 2.5328 - val_accuracy: 0.4514
Epoch 10/25
36/36 [==============================] - 0s 8ms/step - loss: 0.2732 - accuracy: 0.9332 - val_loss: 2.4485 - val_accuracy: 0.5035
Epoch 11/25
36/36 [==============================] - 0s 7ms/step - loss: 0.2249 - accuracy: 0.9410 - val_loss: 2.9475 - val_accuracy: 0.4826
Epoch 12/25
36/36 [==============================] - 0s 8ms/step - loss: 0.2047 - accuracy: 0.9618 - val_loss: 2.8294 - val_accuracy: 0.4514
Epoch 13/25
36/36 [==============================] - 0s 7ms/step - loss: 0.2024 - accuracy: 0.9575 - val_loss: 2.9124 - val_accuracy: 0.4757
Epoch 14/25
36/36 [==============================] - 0s 7ms/step - loss: 0.1161 - accuracy: 0.9783 - val_loss: 3.1302 - val_accuracy: 0.4965
Epoch 15/25
36/36 [==============================] - 0s 7ms/step - loss: 0.1635 - accuracy: 0.9714 - val_loss: 3.0397 - val_accuracy: 0.4931
Epoch 16/25
36/36 [==============================] - 0s 7ms/step - loss: 0.0842 - accuracy: 0.9844 - val_loss: 3.2675 - val_accuracy: 0.4583
Epoch 17/25
36/36 [==============================] - 0s 7ms/step - loss: 0.0602 - accuracy: 0.9939 - val_loss: 3.4523 - val_accuracy: 0.4722
Epoch 18/25
36/36 [==============================] - 0s 8ms/step - loss: 0.0806 - accuracy: 0.9905 - val_loss: 3.5126 - val_accuracy: 0.5000
Epoch 19/25
36/36 [==============================] - 0s 7ms/step - loss: 0.0304 - accuracy: 0.9965 - val_loss: 4.1639 - val_accuracy: 0.4965
Epoch 20/25
36/36 [==============================] - 0s 7ms/step - loss: 0.0544 - accuracy: 0.9878 - val_loss: 4.2558 - val_accuracy: 0.4306
Epoch 21/25
36/36 [==============================] - 0s 7ms/step - loss: 0.0131 - accuracy: 1.0000 - val_loss: 4.2340 - val_accuracy: 0.4444
Epoch 22/25
36/36 [==============================] - 0s 7ms/step - loss: 0.1419 - accuracy: 0.9783 - val_loss: 4.7363 - val_accuracy: 0.4444
Epoch 23/25
36/36 [==============================] - 0s 7ms/step - loss: 0.0533 - accuracy: 0.9913 - val_loss: 4.3147 - val_accuracy: 0.4618
Epoch 24/25
36/36 [==============================] - 0s 8ms/step - loss: 0.0083 - accuracy: 1.0000 - val_loss: 4.5180 - val_accuracy: 0.4583
Epoch 25/25
36/36 [==============================] - 0s 8ms/step - loss: 0.0933 - accuracy: 0.9852 - val_loss: 3.9724 - val_accuracy: 0.4861
In [12]:
pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
plt.title('Loss By Epoch')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
In [13]:
pd.DataFrame(history.history)[['accuracy', 'val_accuracy']].plot()
plt.title('Accuracy By Epoch')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.show()
In [14]:
test_model = keras.models.load_model('ann.keras')
predicted = test_model.predict(flattened_val_data)
predicted = np.argmax(predicted, axis=1)+1
print(classification_report(predicted, val_labels))
9/9 [==============================] - 0s 3ms/step
              precision    recall  f1-score   support

           1       0.18      0.30      0.22        10
           2       0.94      0.48      0.63        63
           3       0.50      0.35      0.42        62
           4       0.17      0.28      0.21        25
           5       0.70      0.49      0.58        39
           6       0.27      0.60      0.37        20
           7       0.50      0.66      0.57        35
           8       0.42      0.44      0.43        34

    accuracy                           0.45       288
   macro avg       0.46      0.45      0.43       288
weighted avg       0.56      0.45      0.48       288

Convolutional Neural Network¶

In this step, I will build a convolutional neural network (CNN) that uses the spectrograms of the audio files as if they were images and learns about different features of the images there. Hopefully, this CNN will have a better performance than both the decision tree model (31% accuracy) and the artificial neural network model (45% accuracy).

In the end, the CNN model instantly started overfitting as seen in the graph that shows the validation loss skyrocket. Although the validation accuracy does seem to increase as more epochs occur, the increasing validation loss seems to indicate that the model is overfitting to features that aren't really there, creating a model that does not generalize well to new data. In this case, the best CNN model fitted produced an accuracy score of just 42%.

I am actually quite surprised that a regular artificial neural network outperformed the convolutional neural network. This could have happened because of random chance or it could be possible that the spectrograms simply don't provide enough information (especially regarding localized patterns) to classify emotions better than a simply array of values.

In [15]:
inputs = keras.Input(shape=(128, 147, 1), name='Input')
x = layers.Conv2D(filters=32, kernel_size=3, activation='relu', name='convolution_layer_1')(inputs)
x = layers.MaxPooling2D(pool_size=2, name='pooling_1')(x)
x = layers.Conv2D(filters=64, kernel_size=3, activation='relu', name='convolution_layer_2')(x)
x = layers.MaxPooling2D(pool_size=2, name='pooling_2')(x)
x = layers.Conv2D(filters=128, kernel_size=3, activation='relu', name='convolution_layer_3')(x)
x = layers.MaxPooling2D(pool_size=2, name='pooling_3')(x)
x = layers.Conv2D(filters=256, kernel_size=3, activation='relu', name='convolution_layer_4')(x)
x = layers.MaxPooling2D(pool_size=2, name='pooling_4')(x)
x = layers.Conv2D(filters=256, kernel_size=3, activation='relu', name='convolution_layer_5')(x)
x = layers.Flatten()(x)

outputs = layers.Dense(8, activation='softmax', name='output')(x)

model = keras.Model(inputs=inputs, outputs=outputs, name='base_cnn')

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()
WARNING:tensorflow:From C:\Users\HaydenH\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\layers\pooling\max_pooling2d.py:161: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

Model: "base_cnn"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 Input (InputLayer)          [(None, 128, 147, 1)]     0         
                                                                 
 convolution_layer_1 (Conv2  (None, 126, 145, 32)      320       
 D)                                                              
                                                                 
 pooling_1 (MaxPooling2D)    (None, 63, 72, 32)        0         
                                                                 
 convolution_layer_2 (Conv2  (None, 61, 70, 64)        18496     
 D)                                                              
                                                                 
 pooling_2 (MaxPooling2D)    (None, 30, 35, 64)        0         
                                                                 
 convolution_layer_3 (Conv2  (None, 28, 33, 128)       73856     
 D)                                                              
                                                                 
 pooling_3 (MaxPooling2D)    (None, 14, 16, 128)       0         
                                                                 
 convolution_layer_4 (Conv2  (None, 12, 14, 256)       295168    
 D)                                                              
                                                                 
 pooling_4 (MaxPooling2D)    (None, 6, 7, 256)         0         
                                                                 
 convolution_layer_5 (Conv2  (None, 4, 5, 256)         590080    
 D)                                                              
                                                                 
 flatten (Flatten)           (None, 5120)              0         
                                                                 
 output (Dense)              (None, 8)                 40968     
                                                                 
=================================================================
Total params: 1018888 (3.89 MB)
Trainable params: 1018888 (3.89 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
In [16]:
callbacks = [keras.callbacks.ModelCheckpoint('cnn.keras', save_best_only=True, monitor='val_loss')]
history = model.fit(train_dataset, validation_data=val_dataset, epochs=25, batch_size=32, callbacks=callbacks)
Epoch 1/25
36/36 [==============================] - 4s 104ms/step - loss: 1.8681 - accuracy: 0.2552 - val_loss: 1.9079 - val_accuracy: 0.2882
Epoch 2/25
36/36 [==============================] - 4s 100ms/step - loss: 1.7166 - accuracy: 0.3568 - val_loss: 1.8403 - val_accuracy: 0.3194
Epoch 3/25
36/36 [==============================] - 4s 100ms/step - loss: 1.6283 - accuracy: 0.3594 - val_loss: 1.6883 - val_accuracy: 0.3611
Epoch 4/25
36/36 [==============================] - 4s 102ms/step - loss: 1.5379 - accuracy: 0.4089 - val_loss: 1.7005 - val_accuracy: 0.3472
Epoch 5/25
36/36 [==============================] - 4s 102ms/step - loss: 1.4286 - accuracy: 0.4540 - val_loss: 1.8021 - val_accuracy: 0.3438
Epoch 6/25
36/36 [==============================] - 4s 102ms/step - loss: 1.3505 - accuracy: 0.5061 - val_loss: 1.5675 - val_accuracy: 0.4167
Epoch 7/25
36/36 [==============================] - 4s 99ms/step - loss: 1.2124 - accuracy: 0.5590 - val_loss: 1.7132 - val_accuracy: 0.4236
Epoch 8/25
36/36 [==============================] - 4s 106ms/step - loss: 1.0518 - accuracy: 0.6128 - val_loss: 1.8883 - val_accuracy: 0.4271
Epoch 9/25
36/36 [==============================] - 4s 105ms/step - loss: 0.9699 - accuracy: 0.6458 - val_loss: 1.6499 - val_accuracy: 0.4757
Epoch 10/25
36/36 [==============================] - 4s 101ms/step - loss: 0.8401 - accuracy: 0.6944 - val_loss: 2.3165 - val_accuracy: 0.4583
Epoch 11/25
36/36 [==============================] - 4s 102ms/step - loss: 0.7120 - accuracy: 0.7587 - val_loss: 3.0223 - val_accuracy: 0.3958
Epoch 12/25
36/36 [==============================] - 4s 104ms/step - loss: 0.6233 - accuracy: 0.7899 - val_loss: 2.7621 - val_accuracy: 0.4306
Epoch 13/25
36/36 [==============================] - 4s 105ms/step - loss: 0.5644 - accuracy: 0.8212 - val_loss: 2.6264 - val_accuracy: 0.4514
Epoch 14/25
36/36 [==============================] - 4s 101ms/step - loss: 0.4248 - accuracy: 0.8507 - val_loss: 2.2267 - val_accuracy: 0.5035
Epoch 15/25
36/36 [==============================] - 4s 101ms/step - loss: 0.3680 - accuracy: 0.8689 - val_loss: 2.9568 - val_accuracy: 0.4931
Epoch 16/25
36/36 [==============================] - 4s 101ms/step - loss: 0.3448 - accuracy: 0.8967 - val_loss: 4.6338 - val_accuracy: 0.4792
Epoch 17/25
36/36 [==============================] - 4s 101ms/step - loss: 0.3116 - accuracy: 0.9167 - val_loss: 3.5512 - val_accuracy: 0.5000
Epoch 18/25
36/36 [==============================] - 4s 101ms/step - loss: 0.2173 - accuracy: 0.9401 - val_loss: 4.4431 - val_accuracy: 0.5208
Epoch 19/25
36/36 [==============================] - 4s 104ms/step - loss: 0.3077 - accuracy: 0.9436 - val_loss: 3.8495 - val_accuracy: 0.5521
Epoch 20/25
36/36 [==============================] - 4s 103ms/step - loss: 0.2302 - accuracy: 0.9280 - val_loss: 3.1742 - val_accuracy: 0.5417
Epoch 21/25
36/36 [==============================] - 4s 103ms/step - loss: 0.1583 - accuracy: 0.9696 - val_loss: 4.1780 - val_accuracy: 0.5486
Epoch 22/25
36/36 [==============================] - 4s 106ms/step - loss: 0.1756 - accuracy: 0.9557 - val_loss: 3.9359 - val_accuracy: 0.5347
Epoch 23/25
36/36 [==============================] - 4s 105ms/step - loss: 0.0883 - accuracy: 0.9800 - val_loss: 4.0909 - val_accuracy: 0.5729
Epoch 24/25
36/36 [==============================] - 4s 104ms/step - loss: 0.1403 - accuracy: 0.9661 - val_loss: 4.7601 - val_accuracy: 0.5729
Epoch 25/25
36/36 [==============================] - 4s 103ms/step - loss: 0.0966 - accuracy: 0.9766 - val_loss: 4.6087 - val_accuracy: 0.5278
In [17]:
pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
plt.title('Loss By Epoch')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
In [18]:
pd.DataFrame(history.history)[['accuracy', 'val_accuracy']].plot()
plt.title('Accuracy By Epoch')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.show()
In [19]:
test_model = keras.models.load_model('cnn.keras')
predicted = test_model.predict(val_dataset)
predicted = np.argmax(predicted, axis=1)+1
print(classification_report(predicted, val_labels))
9/9 [==============================] - 0s 30ms/step
              precision    recall  f1-score   support

           1       0.29      0.23      0.26        22
           2       0.75      0.33      0.46        72
           3       0.18      0.40      0.25        20
           4       0.15      0.26      0.19        23
           5       0.44      0.75      0.56        16
           6       0.47      0.48      0.47        44
           7       0.52      0.52      0.52        46
           8       0.56      0.44      0.49        45

    accuracy                           0.42       288
   macro avg       0.42      0.43      0.40       288
weighted avg       0.50      0.42      0.43       288

Convolutional Neural Network (With Image Augmentation)¶

Since I was a little disappointed with the performance of the model in the previous step, I would like to try adding some augmented images to the data set to try and give the model more data to generalize to. Hopefully, this will allow the model to predict with more confidence, minimizing the validation loss and increasing the validation accuracy.

The augmentation performed on the spectrograms includes time masking (nullifying spaces of time to introduce noise), frequency masking (nullifying frequencies to introduce noise), and pitch shifting (altering the pitch to produce variations in the data set). With luck, these augmentations will produce additional data that the model can use to learn better patterns for identifying emotions in the audio data.

In the end, the model that used augmented data reached a validation accuracy of 51%, which is much better than the previous CNN model with an accuracy of 42%. Thus, it appears that the data augmentation worked!

In [20]:
def time_masking(spectrogram, num_masks=4):
    # This function grabs a certain window of values in the spectrogram and sets them all randomly to 0, introducing noise into the data set
    for i in range(num_masks):
        t = np.random.randint(15, 50)
        t0 = np.random.randint(0, spectrogram.shape[1] - t)
        spectrogram[:, t0:t0+t] = 0
    return spectrogram

def frequency_masking(spectrogram, num_masks=4):
    # This function does the inverse of time masking and actually randomly sets some frequencies to 0 so that they are not seen by the model
    for i in range(num_masks):
        f = np.random.randint(5, 15)
        f0 = np.random.randint(0, spectrogram.shape[0] - f)
        spectrogram[f0:f0+f, :] = 0
    return spectrogram

def load_and_process_augmented_audio(file_path, label):
    audio, sample_rate = librosa.load(file_path)
    audio = audio[:75000] # Cut off audio with more than 75000 samples
    audio = np.concatenate((np.zeros(75000-len(audio)), audio), axis=0)    

    audios = [audio]
    
    for i in range(4): # Let's return the original audio plus four other audios that have had some transformations applied to them for each audio in the data set
        audio = librosa.effects.pitch_shift(audio, sr=44100, n_steps=i) # do some pitch shifting
        audios.append(audio)

    spectrograms = [np.sqrt(librosa.feature.melspectrogram(y=audio, sr=44100)) for audio in audios]

    for i in range(1,4):
        spectrograms[i] = time_masking(spectrograms[i])
        spectrograms[i] = frequency_masking(spectrograms[i])
        
    return spectrograms, [label]*5

augmented_train_data = []
augmented_train_labels = []

for file, label in zip(train_paths, train_labels):
    spectrograms, labels = load_and_process_augmented_audio(file, label)
    augmented_train_data.extend(spectrograms)
    augmented_train_labels.extend(labels)

val_data = []

for file, label in zip(val_paths, val_labels):
    spectrogram, label = load_and_process_audio(file, label)
    val_data.append(spectrogram)
    

sparse_train_labels = np.zeros((len(augmented_train_labels),8))
for i, value in enumerate(np.arange(1,9)):
    sparse_train_labels[:, i] = (augmented_train_labels == value).astype(int)

sparse_val_labels = np.zeros((len(val_labels),8))
for i, value in enumerate(np.arange(1,9)):
    sparse_val_labels[:, i] = (val_labels == value).astype(int)

batch_size = 32

train_dataset = tf.data.Dataset.from_tensor_slices((augmented_train_data, sparse_train_labels))
train_dataset = train_dataset.shuffle(buffer_size=len(train_paths)).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices((val_data, sparse_val_labels))
val_dataset = val_dataset.batch(batch_size)

Let's visualize some of the augmented spectrograms. The first image on the left is the original spectogram.

In [21]:
plt.figure(figsize=(25, 15))
fig, axs = plt.subplots(1, 5, figsize=(16, 10))

for i in range(5):
    spectrogram = augmented_train_data[i]
    label = augmented_train_labels[i]
    axs[i].imshow(tf.transpose(spectrogram))
    axs[i].set_title(f'Spectrogram: \'{label_dict[label]}\'')

plt.show()
<Figure size 2500x1500 with 0 Axes>
In [22]:
inputs = keras.Input(shape=(128, 147), name='Input')
x = layers.Conv1D(filters=512, kernel_size=3, activation='relu', name='convolution_layer_1')(inputs)
x = layers.BatchNormalization()(x)
x = layers.MaxPooling1D(pool_size=2, name='pooling_1')(x)

x = layers.Conv1D(filters=256, kernel_size=3, activation='relu', name='convolution_layer_2')(x)
x = layers.BatchNormalization()(x)
x = layers.MaxPooling1D(pool_size=2, name='pooling_2')(x)

x = layers.Conv1D(filters=128, kernel_size=3, activation='relu', name='convolution_layer_3')(x)
x = layers.BatchNormalization()(x)
x = layers.MaxPooling1D(pool_size=2, name='pooling_3')(x)
x = layers.Dropout(0.2)(x)

x = layers.Conv1D(filters=64, kernel_size=3, activation='relu', name='convolution_layer_4')(x)
x = layers.Flatten()(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(512, activation='relu', name='dense')(x)
x = layers.BatchNormalization()(x)
outputs = layers.Dense(8, activation='softmax', name='output')(x)

model = keras.Model(inputs=inputs, outputs=outputs, name='base_cnn')

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()
Model: "base_cnn"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 Input (InputLayer)          [(None, 128, 147)]        0         
                                                                 
 convolution_layer_1 (Conv1  (None, 126, 512)          226304    
 D)                                                              
                                                                 
 batch_normalization (Batch  (None, 126, 512)          2048      
 Normalization)                                                  
                                                                 
 pooling_1 (MaxPooling1D)    (None, 63, 512)           0         
                                                                 
 convolution_layer_2 (Conv1  (None, 61, 256)           393472    
 D)                                                              
                                                                 
 batch_normalization_1 (Bat  (None, 61, 256)           1024      
 chNormalization)                                                
                                                                 
 pooling_2 (MaxPooling1D)    (None, 30, 256)           0         
                                                                 
 convolution_layer_3 (Conv1  (None, 28, 128)           98432     
 D)                                                              
                                                                 
 batch_normalization_2 (Bat  (None, 28, 128)           512       
 chNormalization)                                                
                                                                 
 pooling_3 (MaxPooling1D)    (None, 14, 128)           0         
                                                                 
 dropout (Dropout)           (None, 14, 128)           0         
                                                                 
 convolution_layer_4 (Conv1  (None, 12, 64)            24640     
 D)                                                              
                                                                 
 flatten_1 (Flatten)         (None, 768)               0         
                                                                 
 dropout_1 (Dropout)         (None, 768)               0         
                                                                 
 dense (Dense)               (None, 512)               393728    
                                                                 
 batch_normalization_3 (Bat  (None, 512)               2048      
 chNormalization)                                                
                                                                 
 output (Dense)              (None, 8)                 4104      
                                                                 
=================================================================
Total params: 1146312 (4.37 MB)
Trainable params: 1143496 (4.36 MB)
Non-trainable params: 2816 (11.00 KB)
_________________________________________________________________
In [23]:
callbacks = [keras.callbacks.ModelCheckpoint('cnn_augmented.keras', save_best_only=True, monitor='val_loss')]
history = model.fit(train_dataset, validation_data=val_dataset, epochs=25, batch_size=32, callbacks=callbacks)
Epoch 1/25
180/180 [==============================] - 7s 31ms/step - loss: 2.1027 - accuracy: 0.2479 - val_loss: 2.7594 - val_accuracy: 0.0799
Epoch 2/25
180/180 [==============================] - 6s 31ms/step - loss: 1.8302 - accuracy: 0.3189 - val_loss: 2.6191 - val_accuracy: 0.1076
Epoch 3/25
180/180 [==============================] - 6s 31ms/step - loss: 1.7357 - accuracy: 0.3483 - val_loss: 2.3662 - val_accuracy: 0.2188
Epoch 4/25
180/180 [==============================] - 5s 30ms/step - loss: 1.6094 - accuracy: 0.3910 - val_loss: 2.0876 - val_accuracy: 0.4062
Epoch 5/25
180/180 [==============================] - 5s 30ms/step - loss: 1.4966 - accuracy: 0.4429 - val_loss: 1.8757 - val_accuracy: 0.4688
Epoch 6/25
180/180 [==============================] - 5s 30ms/step - loss: 1.4130 - accuracy: 0.4707 - val_loss: 1.7945 - val_accuracy: 0.4861
Epoch 7/25
180/180 [==============================] - 5s 30ms/step - loss: 1.3350 - accuracy: 0.5017 - val_loss: 1.8775 - val_accuracy: 0.4549
Epoch 8/25
180/180 [==============================] - 5s 30ms/step - loss: 1.2323 - accuracy: 0.5451 - val_loss: 2.0714 - val_accuracy: 0.4236
Epoch 9/25
180/180 [==============================] - 5s 30ms/step - loss: 1.1659 - accuracy: 0.5700 - val_loss: 1.9651 - val_accuracy: 0.4410
Epoch 10/25
180/180 [==============================] - 5s 30ms/step - loss: 1.1013 - accuracy: 0.5880 - val_loss: 2.3449 - val_accuracy: 0.4410
Epoch 11/25
180/180 [==============================] - 6s 31ms/step - loss: 1.0273 - accuracy: 0.6214 - val_loss: 1.8953 - val_accuracy: 0.5312
Epoch 12/25
180/180 [==============================] - 6s 31ms/step - loss: 0.9589 - accuracy: 0.6552 - val_loss: 1.7928 - val_accuracy: 0.5139
Epoch 13/25
180/180 [==============================] - 6s 32ms/step - loss: 0.9093 - accuracy: 0.6703 - val_loss: 2.0152 - val_accuracy: 0.4722
Epoch 14/25
180/180 [==============================] - 6s 33ms/step - loss: 0.8719 - accuracy: 0.6844 - val_loss: 1.9575 - val_accuracy: 0.4931
Epoch 15/25
180/180 [==============================] - 6s 33ms/step - loss: 0.8112 - accuracy: 0.7035 - val_loss: 2.0537 - val_accuracy: 0.5174
Epoch 16/25
180/180 [==============================] - 6s 32ms/step - loss: 0.7723 - accuracy: 0.7234 - val_loss: 1.9738 - val_accuracy: 0.4896
Epoch 17/25
180/180 [==============================] - 6s 32ms/step - loss: 0.7297 - accuracy: 0.7365 - val_loss: 2.3430 - val_accuracy: 0.4618
Epoch 18/25
180/180 [==============================] - 6s 32ms/step - loss: 0.6975 - accuracy: 0.7507 - val_loss: 2.2241 - val_accuracy: 0.5000
Epoch 19/25
180/180 [==============================] - 6s 32ms/step - loss: 0.6583 - accuracy: 0.7609 - val_loss: 2.1033 - val_accuracy: 0.4722
Epoch 20/25
180/180 [==============================] - 6s 32ms/step - loss: 0.6371 - accuracy: 0.7700 - val_loss: 2.2368 - val_accuracy: 0.4514
Epoch 21/25
180/180 [==============================] - 6s 32ms/step - loss: 0.6028 - accuracy: 0.7847 - val_loss: 2.3910 - val_accuracy: 0.4861
Epoch 22/25
180/180 [==============================] - 6s 32ms/step - loss: 0.5558 - accuracy: 0.8016 - val_loss: 2.1928 - val_accuracy: 0.5382
Epoch 23/25
180/180 [==============================] - 6s 32ms/step - loss: 0.5618 - accuracy: 0.8010 - val_loss: 2.4219 - val_accuracy: 0.4792
Epoch 24/25
180/180 [==============================] - 6s 33ms/step - loss: 0.5363 - accuracy: 0.8083 - val_loss: 2.3705 - val_accuracy: 0.5174
Epoch 25/25
180/180 [==============================] - 6s 33ms/step - loss: 0.4990 - accuracy: 0.8170 - val_loss: 1.9816 - val_accuracy: 0.5556
In [24]:
pd.DataFrame(history.history)[['loss', 'val_loss']].plot()
plt.title('Loss By Epoch')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
In [25]:
pd.DataFrame(history.history)[['accuracy', 'val_accuracy']].plot()
plt.title('Accuracy By Epoch')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.show()
In [26]:
test_model = keras.models.load_model('cnn_augmented.keras')
predicted = test_model.predict(val_dataset)
predicted = np.argmax(predicted, axis=1)+1
print(classification_report(predicted, val_labels))
9/9 [==============================] - 0s 11ms/step
              precision    recall  f1-score   support

           1       0.53      0.38      0.44        24
           2       0.72      0.92      0.81        25
           3       0.48      0.36      0.41        59
           4       0.17      0.64      0.27        11
           5       0.56      0.39      0.46        38
           6       0.58      0.63      0.60        41
           7       0.39      0.44      0.41        41
           8       0.81      0.59      0.68        49

    accuracy                           0.51       288
   macro avg       0.53      0.54      0.51       288
weighted avg       0.56      0.51      0.52       288

Transfer Learning (using the OpenL3 model from Facebook)¶

This code never actually ran correctly due to lack of RAM in my computer. Even when I tried to run the code in Google Colab after decreasing the batch size, Jupyter/Colab crashed before the code could run. However, my hypothesis is that this model actually would not have worked better than the previous model for emotion classification. The OpenL3 model was trained to recognize patterns in speech, but since the objective of this analysis is predicting an emotion and not text, I don't think it would have been pre-trained in such a way that it would aid the model's predictions anyway.

def load_and_process_audio(file_path, label):
    audio, sample_rate = librosa.load(file_path, sr=44100)

    target_length = 48000
    if len(audio) < target_length:
        audio = np.pad(audio, (0, target_length - len(audio)))
    elif len(audio) > target_length:
        audio = audio[:target_length]

    spectrogram = librosa.feature.melspectrogram(y=audio, sr=44100)

    if spectrogram.shape[1] < target_length:
        spectrogram = np.pad(spectrogram, ((0, 0), (0, target_length - spectrogram.shape[1])))
    elif spectrogram.shape[1] > target_length:
        spectrogram = spectrogram[:, :target_length]

    spectrogram = np.expand_dims(spectrogram, axis=0)

    return spectrogram, label

train_data = []

for file, label in zip(train_paths, train_labels):
    spectrogram, label = load_and_process_audio(file, label)
    train_data.append(spectrogram)

val_data = []

for file, label in zip(val_paths, val_labels):
    spectrogram, label = load_and_process_audio(file, label)
    val_data.append(spectrogram)

sparse_train_labels = np.zeros((train_labels.shape[0],8))
for i, value in enumerate(np.arange(1,9)):
    sparse_train_labels[:, i] = (train_labels == value).astype(int)

sparse_val_labels = np.zeros((val_labels.shape[0],8))
for i, value in enumerate(np.arange(1,9)):
    sparse_val_labels[:, i] = (val_labels == value).astype(int)

batch_size = 32

train_dataset = tf.data.Dataset.from_tensor_slices((train_data, sparse_train_labels))
train_dataset = train_dataset.shuffle(buffer_size=len(train_paths)).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices((val_data, sparse_val_labels))
val_dataset = val_dataset.batch(batch_size)

import openl3

model = openl3.models.load_audio_embedding_model(input_repr="mel256", content_type="music", embedding_size=512)

for layer in model.layers[:-2]:
    layer.trainable = False

inputs = keras.Input(shape=(1, 48000), name='Input')
x = model(inputs)
outputs = layers.Dense(8, activation='softmax')(x)

model = keras.Model(inputs=inputs, outputs=outputs, name='transfer_model')

model.compile(optimizer='rmsprop', loss='categorical_cross_entropy', metrics=['accuracy'])
model.summary()

callbacks = [keras.callbacks.ModelCheckpoint('transfer.keras', save_best_only=True, monitor='val_loss')]
history = model.fit(train_dataset, validation_data=val_dataset, epochs=25, batch_size=32, callbacks=callbacks)

Results¶

The best model that I was able to produce for identifying emotions in spoken audio was the convolutional neural network built with augmented data. This model performed with a 51% accuracy when tested against the validation data set. The model was best able to correctly classify audios with class "calm" and "fearful". The model was worst at classifying audios that were "sad".

One thing that I did not mention at the beginning of this analysis that I did to increase the performance of all the models was scale the values of the spectrogram using the np.sqrt() function. The first time that I created these neural networks with plain spectrograms, the values were so faint that they barely showed up at all in the audio images. Scaling seemed to increase the performance of all models by around 10% each. However, because I am new at audio analysis, I think that there is more that I can do to exaggerate the values in the future to pull out even more features in the spectrograms. This would allow the models to identify more and better features in each of the spectrograms that could be used for better classification.

Data augmentation actually made this analysis a great one. Adding some data augmentation techniques improved the peformance of the model by around 9%. However, I didn't know anything about audio analysis when I began this analysis and learned everything that I know about audio transformations as I wrote the code to execute it. There are many other kinds of audio transformations that I think I could add into this data augmentation phase that I think could improve the performance of the model even more. I would like to explore these techniques further in the future.

In the end, this project taught me a lot about audio analysis and the kinds of models and techniques that can be used to analyze audio. I feel more empowered than ever to explore audio classification again in the future.